[wip] Distributed Scion/Muon #1630

rakkit · 2025-08-25T02:05:45Z

This is a distributed version of Scion or Modular Norm, muon is considered to be a variant of this by using explicit AdamW for LLM's embedding/output.

Works:

Embedding/head
FSDP/DP/TP/EP/CP/PP parameters
Bias
weight decay
Conv
Automatically disable RMSNORM's grad
raise error for EP with [shard(1)]

Maybe need to check ETP? And it's not working for multiple shared_experts.

CC @janEbert @ofivite

tianyu-l

Thanks for the PR on cutting-edge features!

I didn't read the papers so please forgive me if what I comment doesn't make sense.

I guess for "core" changes such as this one on optimizers, the recommended path is to first land in pytorch/pytorch, and then expose minimal interfaces to torchtitan. torchtitan shouldn't be a place to host core features.

cc @janeyx99 on interesting optimizer work

rakkit · 2025-09-01T16:41:55Z

Update: the init refactor is done. You can check the diff here
. This optimiser is not aggressive and does not involve many changes to components.optimizer. However, we still need to add the configs. (Maybe add an extra configs class?)

I have added the debug configs, so you can try it now:
CONFIG_FILE="./torchtitan/experiments/distributed_scion/train_configs/debug_model.toml" NGPU=4 ./run_train.sh --compile.enable

There is also a “clean” version where I removed the logging code, making it easier to read and understand.

Note: in a random test, Scion is using a higher learning rate. Muon/Scion allows us to train a model with a high LR.

Distributed-Scion (LR=0.1, disable gradient norm clipping)
[rank0]:[titan] 2025-09-01 18:32:41,146 - root - INFO - step:  1  loss:  8.1319  grad_norm:  2.8994  memory:  1.85GiB(1.99%)  tps: 1,205  tflops: 0.12  mfu: 0.01%
[rank0]:[titan] 2025-09-01 18:32:41,146 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-01 18:32:41,221 - root - INFO - step:  2  loss:  7.2448  grad_norm:  2.8807  memory:  1.86GiB(2.00%)  tps: 219,132  tflops: 22.46  mfu: 2.27%
[rank0]:[titan] 2025-09-01 18:32:41,291 - root - INFO - step:  3  loss:  5.1976  grad_norm:  3.2225  memory:  1.86GiB(2.00%)  tps: 232,809  tflops: 23.86  mfu: 2.41%
[rank0]:[titan] 2025-09-01 18:32:41,363 - root - INFO - step:  4  loss:  4.4230  grad_norm:  2.7686  memory:  1.86GiB(2.00%)  tps: 230,201  tflops: 23.60  mfu: 2.39%
[rank0]:[titan] 2025-09-01 18:32:41,432 - root - INFO - step:  5  loss:  4.1285  grad_norm:  4.4248  memory:  1.86GiB(2.00%)  tps: 235,659  tflops: 24.16  mfu: 2.44%
[rank0]:[titan] 2025-09-01 18:32:41,507 - root - INFO - step:  6  loss:  3.9301  grad_norm:  4.0992  memory:  1.86GiB(2.00%)  tps: 219,604  tflops: 22.51  mfu: 2.28%
[rank0]:[titan] 2025-09-01 18:32:41,577 - root - INFO - step:  7  loss:  3.7481  grad_norm:  4.3226  memory:  1.86GiB(2.00%)  tps: 237,191  tflops: 24.31  mfu: 2.46%
[rank0]:[titan] 2025-09-01 18:32:41,645 - root - INFO - step:  8  loss:  3.4784  grad_norm:  3.3010  memory:  1.86GiB(2.00%)  tps: 238,348  tflops: 24.43  mfu: 2.47%
[rank0]:[titan] 2025-09-01 18:32:41,715 - root - INFO - step:  9  loss:  3.5601  grad_norm:  2.4235  memory:  1.86GiB(2.00%)  tps: 235,945  tflops: 24.18  mfu: 2.45%
[rank0]:[titan] 2025-09-01 18:32:41,790 - root - INFO - step: 10  loss:  3.3342  grad_norm:  2.3503  memory:  1.86GiB(2.00%)  tps: 219,221  tflops: 22.47  mfu: 2.27%


Adamw (LR=8e-4)
[rank0]:[titan] 2025-09-01 18:28:22,954 - root - INFO - step:  1  loss:  8.1141  grad_norm:  2.7299  memory:  1.86GiB(2.00%)  tps: 4,072  tflops: 0.42  mfu: 0.04%
[rank0]:[titan] 2025-09-01 18:28:22,954 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-01 18:28:23,015 - root - INFO - step:  2  loss:  6.8200  grad_norm:  3.5221  memory:  1.87GiB(2.01%)  tps: 268,831  tflops: 27.56  mfu: 2.79%
[rank0]:[titan] 2025-09-01 18:28:23,073 - root - INFO - step:  3  loss:  5.0779  grad_norm:  3.7820  memory:  1.87GiB(2.01%)  tps: 282,964  tflops: 29.00  mfu: 2.93%
[rank0]:[titan] 2025-09-01 18:28:23,133 - root - INFO - step:  4  loss:  4.8205  grad_norm:  2.8645  memory:  1.87GiB(2.01%)  tps: 277,407  tflops: 28.43  mfu: 2.88%
[rank0]:[titan] 2025-09-01 18:28:23,189 - root - INFO - step:  5  loss:  4.4854  grad_norm:  2.4403  memory:  1.87GiB(2.01%)  tps: 292,630  tflops: 30.00  mfu: 3.03%
[rank0]:[titan] 2025-09-01 18:28:23,250 - root - INFO - step:  6  loss:  4.2748  grad_norm:  2.1404  memory:  1.87GiB(2.01%)  tps: 268,434  tflops: 27.52  mfu: 2.78%
[rank0]:[titan] 2025-09-01 18:28:23,305 - root - INFO - step:  7  loss:  4.1164  grad_norm:  1.8821  memory:  1.87GiB(2.01%)  tps: 297,987  tflops: 30.54  mfu: 3.09%
[rank0]:[titan] 2025-09-01 18:28:23,362 - root - INFO - step:  8  loss:  4.0172  grad_norm:  1.9915  memory:  1.87GiB(2.01%)  tps: 291,299  tflops: 29.86  mfu: 3.02%
[rank0]:[titan] 2025-09-01 18:28:23,418 - root - INFO - step:  9  loss:  4.0967  grad_norm:  1.7458  memory:  1.87GiB(2.01%)  tps: 292,345  tflops: 29.97  mfu: 3.03%
[rank0]:[titan] 2025-09-01 18:28:23,478 - root - INFO - step: 10  loss:  3.9069  grad_norm:  1.5619  memory:  1.87GiB(2.01%)  tps: 272,612  tflops: 27.94  mfu: 2.83%```

rakkit · 2025-09-02T21:10:46Z

Most of the features are ready, so we can start reviewing them ..

rakkit requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 25, 2025 02:05

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 25, 2025

rakkit force-pushed the dist-scion branch from a67f1c3 to eae4684 Compare August 25, 2025 02:08

rakkit mentioned this pull request Aug 25, 2025

[RFC] distributed scion/muon #1636

Open

rakkit force-pushed the dist-scion branch from eae4684 to 40e9cbb Compare August 26, 2025 00:53

tianyu-l requested changes Aug 26, 2025

View reviewed changes

rakkit force-pushed the dist-scion branch from 40e9cbb to 0586ffd Compare September 1, 2025 16:35

init scion

09f0c98

rakkit force-pushed the dist-scion branch from 0586ffd to 09f0c98 Compare September 2, 2025 21:03

rakkit requested a review from tianyu-l September 2, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] Distributed Scion/Muon #1630

[wip] Distributed Scion/Muon #1630

rakkit commented Aug 25, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

rakkit commented Sep 1, 2025 •

edited

Loading

Uh oh!

rakkit commented Sep 2, 2025

Uh oh!

Uh oh!

[wip] Distributed Scion/Muon #1630

Are you sure you want to change the base?

[wip] Distributed Scion/Muon #1630

Conversation

rakkit commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

rakkit commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakkit commented Sep 2, 2025

Uh oh!

Uh oh!

rakkit commented Aug 25, 2025 •

edited

Loading

rakkit commented Sep 1, 2025 •

edited

Loading